Learning to Separate Object Sounds by Watching Unlabeled Video
نویسندگان
چکیده
Perceiving a scene most fully requires all the senses. Yet modeling how objects look and sound is challenging: most natural scenes and events contain multiple objects, and the audio track mixes all the sound sources together. We propose to learn audio-visual object models from unlabeled video, then exploit the visual context to perform audio source separation in novel videos. Our approach relies on a deep multi-instance multi-label learning framework to disentangle the audio frequency bases that map to individual visual objects, even without observing/hearing those objects in isolation. We show how the recovered disentangled bases can be used to guide audio source separation to obtain better-separated, object-level sounds. Our work is the first to study audio source separation in large-scale general “in the wild” videos. We obtain state-of-the-art results on visually-aided audio source separation and audio denoising.
منابع مشابه
Time-Contrastive Networks: Self-Supervised Learning from Video
We propose a self-supervised approach for learning representations and robotic behaviors entirely from unlabeled videos recorded from multiple viewpoints, and study how this representation can be used in two robotic imitation settings: imitating object interactions from videos of humans, and imitating human poses. Imitation of human behavior requires a viewpoint-invariant representation that ca...
متن کاملAnticipating the future by watching unlabeled video
In many computer vision applications, machines will need to reason beyond the present, and predict the future. This task is challenging because it requires leveraging extensive commonsense knowledge of the world that is difficult to write down. We believe that a promising resource for efficiently obtaining this knowledge is through the massive amounts of readily available unlabeled video. In th...
متن کاملThe Role of Avatar in Interactive Fictional World of Video Games
In third-person video games, players are able to move and progress in the interactive world of the game while watching their avatar from an external point of view. The purpose of this paper is to investigate the role of avatar in the interactive imaginary world of video games using double vision theory. This article is based on descriptive-analytical methods and the use of library data and imag...
متن کاملLarge-Scale Object Discovery and Detector Adaptation from Unlabeled Video
We explore object discovery and detector adaptation based on unlabeled video sequences captured from a mobile platform. We propose a fully automatic approach for object mining from video which builds upon a generic object tracking approach. By applying this method to three large video datasets from autonomous driving and mobile robotics scenarios, we demonstrate its robustness and generality. B...
متن کاملVisualizing Video Sounds through Sound Word Animation
Sound information in video plays an important role in constructing audience experience. On the other hand, there are many circumstances where the audience cannot watch video with sounds. Subscripts are conventionally used as visual aids to provide the missing sound information. However, conventional subscripts are far less expressive for non-verbal sounds since it is designed to visualize speec...
متن کامل